{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Tce3stUlHN0L"
      },
      "source": [
        "##### Copyright 2020 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "tuOe1ymfHZPu"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MfBg1C5NB3X0"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://www.tensorflow.org/io/tutorials/genome\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/genome.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/io/blob/master/docs/tutorials/genome.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
        "  </td>\n",
        "      <td>\n",
        "    <a href=\"https://storage.googleapis.com/tensorflow_docs/io/docs/tutorials/genome.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xHxb-dlhMIzW"
      },
      "source": [
        "## Overview\n",
        "\n",
        "This tutorial demonstrates the `tfio.genome` package that provides commonly used genomics IO functionality--namely reading several genomics file formats and also providing some common operations for preparing the data (for example--one hot encoding or parsing Phred quality into probabilities). \n",
        "\n",
        "This package uses the [Google Nucleus](https://github.com/google/nucleus) library to provide some of the core functionality. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MUXex9ctTuDB"
      },
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IqR2PQG4ZaZ0"
      },
      "outputs": [],
      "source": [
        "try:\n",
        "  %tensorflow_version 2.x\n",
        "except Exception:\n",
        "  pass\n",
        "!pip install tensorflow-io"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bkF2WtCMaJ-3"
      },
      "outputs": [],
      "source": [
        "import tensorflow_io as tfio\n",
        "import tensorflow as tf"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6wkjlql3cOy0"
      },
      "source": [
        "## FASTQ Data\n",
        "FASTQ is a common genomics file format that stores both sequence information in addition to base quality information.\n",
        "\n",
        "First, let's download a sample `fastq` file."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yASvppCxceBu"
      },
      "outputs": [],
      "source": [
        "# Download some sample data:\n",
        "!curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3zekWXlVdprb"
      },
      "source": [
        "### Read FASTQ Data\n",
        "Now, let's use `tfio.genome.read_fastq` to read this file (note a `tf.data` API coming soon)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vl761cHTc7N1"
      },
      "outputs": [],
      "source": [
        "fastq_data = tfio.genome.read_fastq(filename=\"test.fastq\")\n",
        "print(fastq_data.sequences)\n",
        "print(fastq_data.raw_quality)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qxHjVKXzdx5W"
      },
      "source": [
        "As you see, the returned `fastq_data` has `fastq_data.sequences` which is a string tensor of all sequences in the fastq file (which can each be a different size) along with `fastq_data.raw_quality` which includes Phred encoded quality information about the quality of each base read in the sequence.\n",
        "\n",
        "### Quality\n",
        "You can use a helper op to convert this quality information into probabilities if you are interested."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6IYxfFI4eQTM"
      },
      "outputs": [],
      "source": [
        "quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality)\n",
        "print(quality.shape)\n",
        "print(quality.row_lengths().numpy())\n",
        "print(quality)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bg3wzTFzhcfS"
      },
      "source": [
        "### One hot encodings\n",
        "You may also want to encode the genome sequence data (which consists of `A` `T` `C` `G` bases) using a one hot encoder. There's a built in operation that can help with this.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oAiepmy8h32a"
      },
      "outputs": [],
      "source": [
        "one_hot = tfio.genome.sequences_to_onehot(fastq_data.sequences)\n",
        "print(one_hot)\n",
        "print(one_hot.shape)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oAiepmy8h32a"
      },
      "outputs": [],
      "source": [
        "print(tfio.genome.sequences_to_onehot.__doc__)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [
        "Tce3stUlHN0L"
      ],
      "name": "genome.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
